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Abstract 


This  paper  presents  a  new  class  of  methods  for  solving  unconstrained  optimization  problems  on 
parallel  computers.  The  methods  are  intended  to  solve  small  to  moderate  dimensional  problems  where 
function  and  derivative  evaluation  is  the  dominant  cost  They  utilize  multiple  processors  to  evaluate  the 
function,  (finite  difference)  gradient  and  a  portion  of  the  finite  difference  Hessian  simultaneously  at  each 
iterate.  We  introduce  three  types  of  new  methods,  which  all  utilize  the  new  finite  difference  Hessian 
information  in  forming  the  new  Hessian  approximation  at  each  iteration;  they  differ  in  whether  and  how 
they  utilize  the  standard  secant  information  from  the  current  step  as  well.  We  present  theoretical  analyses 
of  the  rate  of  convergence  of  several  of  these  methods.  We  also  present  computational  results  which 
illustrate  their  performance  on  parallel  computers  when  function  evaluation  is  expensive. 


1.  Introduction. 


This  paper  presents  a  new  class  of  methods  for  solving  unconstrained  optimization  problems  on 
parallel  computers.  The  methods  are  intended  to  solve  small  to  moderate  dimensional  problems  where 
function  and  derivative  evaluation  is  the  dominant  cost  They  utilize  multiple  processors  to  evaluate  the 
function,  (finite  difference)  gradient,  and  a  portion  of  the  finite  difference  Hessian  simultaneously  at  each 
iterate.  We  present  theoretical  analyses  of  the  rate  of  convergence  of  several  of  these  methods.  We  also 
present  computational  results  which  illustrate  their  performance  cm  parallel  computers  when  function 
evaluation  is  expensive. 

The  unconstrained  optimization  problem  is 

min /:*"-►*  (U) 

where  /(x)  is  assumed  to  be  at  least  twice  continuously  differentiable.  This  problem  occurs  commonly 
in  many  applications,  including  modeling,  data  fitting,  and  planning  calculations  in  most  areas  of  science 
and  engineering.  This  paper  is  solely  concerned  with  finding  a  local  minimizer  of/ (x),  the  lowest  point 
of  /  CO  in  some  open  neighborhood  of  the  variable  space.  This  is  the  most  common  unconstrained 
optimization  calculation  in  practice.  For  a  discussion  of  parallel  methods  for  global  optimization,  fire 
problem  of  finding  the  lowest  among  multiple  local  minimizes  off  (x),  see  Byrd.  Dert,  Rinnooy  Kan. 
and  Schnabel  (1986). 

Unconstrained  optimization  problems  often  are  expensive  to  solve.  One  reason  is  that  the  objective 
function,  /(x),  often  is  itself  a  complex  computer  code,  for  example  the  solution  of  a  system  of  partial 
differential  equations.  It  is  not  unusual  for  each  evaluation  of  /  (x )  to  require  many  seconds,  or  minutes, 
on  a  powerful  computer.  In  addition,  in  many  instances  when  /  (x)  is  expensive,  the  derivatives  of  fix) 
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I 

are  not  available  analytically.  In  this  case,  optimization  codes  approximate  the  gradient  of  /  (x)  at  a  point 
%  by  using  the  finite  difference  approximation 

V/Cxe),  -  /C*e  (12) 

where  A,  is  a  small  stepsize  and  et  denotes  the  frt  unit  vector.  This  means  that  each  gradient  evaluation 
requires  n  function  evaluations  in  addition  to /CO.  Since  the  solution  to  the  optimization  problem  usu¬ 
ally  requires  many  evaluations  off  (x)  and  V/  (r),  it  becomes  an  expensive  process.  Usually,  no  higher 
derivatives  are  used  by  optimization  algorithms  when  /  (x)  is  expensive. 

If  the  number  of  variables  is  not  too  large,  say  it  $100,  then  the  time  required  by  the  remainder  of 
the  optimization  algorithm  often  is  insignificant  in  comparison  to  the  time  required  for  the  function  and 
gradient  calculations.  This  is  the  class  of  problems  we  consider  in  this  paper.  We  orient  our  discussions 
to  the  common  case  when  gradients  are  calculated  by  finite  differences.  We  will  point  out,  however,  that 
our  techniques  can  also  be  applied  to  instances  where/(x)  is  expensive  but  the  analytic  gradient  is  avail¬ 
able. 

Methods  for  solving  unconstrained  optimizations  problems  with  a  small  or  moderate  number  of 
variables  on  sequential  computers  are  quite  well  understood  (see  e.g.  Fletcher  (1980),  Gill.  Murray,  and 

j  Wright  (1981),  or  Dennis  and  Schnabel  (1983)).  When  second  derivatives  are  available  analytically  or 

! 

affordable  by  finite  differences,  variants  of  Newton's  method  are  used.  When  analytic  second  derivatives 
are  unavailable  and  function  evaluation  is  expensive,  variants  of  the  BFGS  method  are  most  commonly 

t 

!  used,  using  either  analytic  or  finite  difference  gradients.  A  very  brief  description  of  such  methods  is 

included  in  Section  2.  The  Newton's  method  based  algorithms,  which  are  locally  quadratically  convcr- 

f 

j  gent  on  most  problems,  generally  require  fewer  iterations  than  the  superlineariy  convergent  BFGS  based 

methods,  but  they  generally  require  more  function  evaluations  and  hence  more  computer  time  on 
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problems  where  function  evaluation  is  expensive.  It  will  be  seen  that  our  new  parallel  methods  try  to 
incorporate  advantages  of  both  approaches. 

Due  to  tiie  high  cost  of  solving  many  optimization  problems,  there  is  ample  incentive  to  devise 
methods  for  solving  them  on  parallel  computers  if  they  can  lead  to  significantly  foster  or  more  cost  effec¬ 
tive  solution  of  these  problems.  For  problems  with  expensive  function  evaluations,  there  are  two  obvious 
types  of  approaches.  One  can  use  standard  sequential  optimization  methods  but  apply  a  parallel  algo¬ 
rithm  to  evaluate /(x),  or  one  can  devise  methods  that  make  effective  use  of  evaluating  /(x)  at  multiple 
points  concurrently.  In  this  paper  we  consider  the  latter  approach.  The  former  approach,  applying  a 
parallel  algorithm  to  evaluate  /  (x ).  is  dependent  upon  the  actual  objective  function  /  (x )  and  is  not  under 
the  control  of  the  optimization  algorithm  designer.  It  should  be  noted,  however,  that  the  two  approaches 
often  are  quite  compatible.  For  example,  in  cases  where  the  evaluation  of /(x)  vectorizes  well,  a  com¬ 
puter  which  consists  of  multiple  vector  processors  would  allow  multiple  evaluations  of  /(x)  to  be  per¬ 
formed  concurrently  with  each  evaluation  performed  by  a  vector  processor. 

The  concurrent  evaluation  of  /(x)  at  multiple  points  is  well  suited  to  any  computer  that  can  exe¬ 
cute  multiple,  different  instruction  streams  concurrently.  Such  machines  are  known  as  Multiple  Instruc¬ 
tion  Multiple  Data  (MIMD)  computers.  This  class  includes  both  shared  memory  multiprocessors,  and 
local  memory  multiprocessors  such  as  hypercubes;  the  algorithms  we  discuss  are  well  suited  to  any  such 
computer.  Our  approach  is  not  generally  suited  to  Single  Instruction  Multiple  Data  (SIMD)  computers, 
such  as  processor  arrays,  whose  processors  can  execute  the  same  instruction  on  different  data  in  lockstep. 
This  is  because  different  evaluations  of  an  expensive  function  /(x)  usually  entail  different  sequences  of 
instructions,  due  to  data  dependent  branches  in  the  code  for  /CO,  and  thus  cannot  easily  be  performed 
concurrently  on  an  SIMD  machine. 
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The  most  obvious  way  to  parallelize  unconstrained  optimization  algorithms  when  the  evaluation  of 
/(x)  is  expensive  and  the  gradient  is  calculated  by  finite  differences  is  to  perform  the  n  function  evalua¬ 
tions  required  by  the  finite  difference  gradient  in  parallel.  (If  the  number  of  processors,  p,  is  less  than  n , 
then  f  ft/p]  groups  of p  function  evaluations  each  can  be  performed  in  parallel.)  Simple  approaches  along 
these  lines  are  discussed  in  Schnabel  (1986).  He  demonstrates  that  if  one  also  incorporates  the  technique 
of  always  evaluating  the  finite  difference  gradient  when  tire  function  value  at  a  new  trial  point  is  being 
evaluated  (orp-1  components  of  the  finite  difference  gradient  if  p  <n+l),  before  it  is  known  whether  this 
point  will  be  accepted  as  an  iterate  and  its  gradient  needed,  then  one  can  make  very  efficient  use  of  up  to 
ri+1  processors  on  most  problems  with  expensive  functions.  We  refer  to  this  technique  as  speculative 
gradient  evaluation. 

Such  algorithms,  which  evaluate  /(x)  and  the  n  function  evaluations  for  the  finite  difference  gra¬ 
dient  concurrently,  can  utilize  at  most  n+1  processors  (assuming  that  each  evaluation  of  /  (x)  uses  only 
one  processor).  In  this  paper  we  consider  strategies  that  would  be  appropriate  when  p  >n+ 1.  Since  there 
are  many  unconstrained  optimization  problems  where  the  evaluation  of  /(x)  is  very  expensive  but  the 
number  of  variables  is  small,  say  n  £25,  and  since  many  parallel  computers  already  have  scores  or  hun¬ 
dreds  of  processors,  this  is  a  reasonably  common  scenario.  It  can  be  expected  to  become  even  more  com¬ 
mon  as  the  number  of  processors  in  MIMD  computers  grows.  It  will  be  seen  that  our  strategies  can  also 
be  applied  to  computers  with  fewer  than  n  processors  in  the  case  when  /(x)  and  V/(x)  are  both 
evaluated  analytically  by  one  processor. 

Ifp2(n2  +  5n  +  2)12 ,  then  it  is  possible  to  evaluate  the  function,  the  finite  difference  gradient,  and  a 
finite  different  Hessian  approximation  simultaneously,  and  the  best  strategy  is  probably  just  a  standard 
Newton’s  method  based  algorithm.  A  very  likely  situation,  however,  is  that 
p  c  (n+1).  (n2  +  5n  +2)/2),  so  that  there  are  more  than  enough  processors  to  evaluate  the  finite 
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difference  gradient  but  not  enough  to  calculate  the  full  finite  difference  Hessian  as  well  For  example,  if 
P  *  64,  any  problem  with  n  e  [10, 63]  falls  in  this  class.  This  is  the  main  situation  we  consider  in  this 
paper. 

The  remainder  of  this  paper  considers  ways  to  use  multiple  processors  to  evaluate  the  function, 
(finite  difference)  gradient,  and  a  portion  of  the  information  comprising  the  finite  difference  Hessian  at 
each  iteration.  In  Section  2  we  propose  a  number  of  possible  ways  to  do  this.  They  are  based  upon  using 
the  extra  processors  to  calculate  extra  (finite  difference)  gradients  which  determine  V3/  (xc)u,  in  carefully 
chosen  directions  u,  ,  and  incorporating  this  information  either  by  overwriting  part  of  the  Hessian  approxi¬ 
mation  or  by  secant  updates.  Some  of  our  algorithms  also  incorporate  the  standard  secant  equation  into 
the  Hessian  approximation.  In  Section  3  we  analyze  the  local  convergence  of  some  of  these  methods. 
Several  are  shown  to  be  m-step  quadratically  convergent,  for  p  large  enough  that  n  extra  gradients  can  be 
calculated  in  the  course  of  m  steps.  In  Section  4  we  discuss  several  considerations  involved  in  the  imple¬ 
mentation  of  these  algorithms.  Section  3  contains  computational  results  of  running  many  of  these 
methods  on  standard  unconstrained  optimization  test  problems.  While  the  results  are  obtained  on  a 
sequential  computer,  they  are  easily  used  to  show  what  speedup  would  be  obtained  on  parallel  computers 
when  function  evaluation  is  the  overriding  cost  In  Section  6  we  summarize  our  conclusions  and  briefly 
discuss  our  plans  for  continuing  this  research. 

We  conclude  this  section  by  giving  some  notation  dial  we  will  use  in  this  paper. 
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Notation 

Let  g  (x )  *  Vf  (x )  and  H  (x )  *  V3/  (x ) . 

Define  II  II  to  be  the  Euclidean  norm  II  1 1 2,  and  let  tc04 )  =  I  \A  1 1  I  \A~]  1 1  for  any  non-singular 
nYji  real  matrix  A. 


For  any  nxn  real  symmetric  matrix  B ,  let  Xi(B ),...,  X*  (£)  be  the  n  real  eigenvalues  of  B ,  in  increasing 
order. 


2.  Algorithms. 


hi  this  section  we  describe  a  number  of  alternative  algorithms  for  unconstrained  optimization  in  a 
parallel  processing  environment  They  include  straightforward  parallelizations  of  the  usual  BFGS  and 
Newton’s  methods,  as  well  as  a  number  of  new  algorithms.  All  of  these  algorithms  may  be  regarded  as 
interpolations  between  Newton’s  method  and  a  quasi-Newton  method.  Fust  we  present  a  basic  algo¬ 
rithmic  framework  for  all  these  methods  (Algorithm  2.1),  and  then  we  motivate  and  describe  several 
specific  algorithms  as  versions  of  this  basic  framework. 

AH  our  algorithms  are  intended  fOT  the  case  when  function  evaluation  is  expensive.  We  orient  our 
discussion  in  this  section  to  the  case  when  the  evaluation  of  g  (x)  is  by  the  finite  difference  formula  (1.2), 
so  that  it  requires  n  function  evaluations,  plus  the  evaluation  of  /  (x).  Our  algorithms  will  perform  these 
function  evaluations  concurrently.  For  simplicity,  we  assume  that  the  number  of  processors,  p ,  is 

(?+lXn+l),  forsomeq  2t  0,  and  that  m  *  ~  is  an  integer.  Our  algorithms  can  also  be  applied  to  the  case 

where  the  gradient  is  available  analytically  and  each  component  g,-(x)  can  be  evaluated  by  a  separate  pro¬ 
cessor,  in  this  case,  they  require  p  *  n  (q  + 1)  + 1  processors.  They  could  also  be  used  in  a  case  where 
g  (x)  and / (x)  are  evaluated  on  the  same  processor  and  p  *  q  + 1  processors  are  available. 


Algorithm  2.1. 


0)  Let  a  €  (0,  Vi),  p  €  (a,  1),  xi  e  Rm,  q  Z  0,  B0  be  a  positive  definite,  symmetric  matrix,  and  k  :=  1. 


1)  Evaluate  / 1  :*/  (at  0.  g  1  :*  g  (x  !>,  and  possibly  other  values  in  parallel. 

2)  Determine  Bt. 
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3)  Usingfl*,  compute  search  direction  4k. 

4)  Determine  the  set  Ut+\  of  q  finite  difference  directions. 

3)  Set  steplength  jt*  1. 

6)  *+M*. 

7)  Compute  /  (x*+i),  g  (x*«.i),  and  V*+j  =  finite  difference  approximation  to  H  (x*+i)U*+i  in  parallel. 

(requires  q  additional  gradient  values) 

8)  If  (x*+i  does  not  satisfy  conditions  (2.1)  and  (22) )  then  adjust  p*  and  go  to  6). 

9)  Determine  S*+i. 

10)  k  :=*  +  1. 

11)  Stop  or  go  to  3). 


V 

k 


Note  that  this  algorithmic  framework  includes  the  standard  methods,  in  that  it  allows  B*+ x  to  be  set 
to  the  values  chosen  by  Newton's  method  or  standard  secant  methods  in  step  9).  However,  computing 
additional  information  in  step  7)  and  using  this  information  in  step  9)  results  in  some  new  algorithms  that 
can  take  advantage  of  the  ability  to  perform  computations  in  parallel.  Steps  1),  2),  3),  4),  7),  and  9)  are 
left  unspecified,  and  by  making  specific  choices  in  these  steps  we  will  specify  the  new  methods. 


The  key  issue  is  that  additional  function  evaluations  can  be  done  simultaneously  with  the  evalua¬ 
tion  of/(x)  and  g(x)  at  step  7),  if  sufficiently  many  processors  are  available.  One  of  the  main  ideas  of 
die  new  algorithms  is  to  utilize  the  extra  processors  to  improve  the  Hessian  approximation  at  the  iterate 
x*+i.  In  step  4),  the  option  is  available  to  pick  directions  U*+i  from  x*+i  along  which  extra  gradient  infor¬ 
mation  will  be  computed  in  step  7).  Note  that  these  finite  difference  directions.  £/*«.j,  must  be  chosen 


without  knowledge  of  /(x*+i)  or  g(x*+i).  Then,  in  step  7),  the  gradients  g(x**i  +  h,  [/*♦)«,),  for 


■  .  » 


i  *  1 . q,  where  hi  is  a  small  stepsize.  may  be  approximated  from  function  evaluations.  These  gra¬ 

dient  evaluations,  which  require  q  (n  +  1)  additional  function  evaluations,  are  used  to  determine  a  finite 
difference  approximation  to  V*+i«#f(xt+))U*+i  as  detailed  in  Section  4. 

The  question  of  how  to  utilize  the  extra  gradient  information  obtained  in  step  7)  to  help  form  B±+i 
is  left  unspecified  in  step  9).  This  step,  combined  with  the  choice  of  finite  difference  directions  in  step  4), 
is  die  crucial  part  of  the  new  algorithms.  Note  that  steps  1)  and  7)  are  the  only  locations  in  Algorithm  2.1 
at  which  function  evaluations  are  performed,  except  for  Newton’s  method,  which  also  performs  some 
function  evaluations  in  step  9).  Step  1)  provides  for  the  possibility  of  doing  extra  function  evaluations,  in 
parallel  with  the  computation  of  /  (x{)  and  g(x  i),  to  produce  B  i  that  is  a  better  approximation  to  H(x\) 
than  the  a  priori  estimate  Bo. 

For  some  of  the  algorithms,  the  search  direction  dk  is  taken  to  be  -Bk~lgk,  but  in  leaving  step  3) 
unspecified  we  provide  for  two  other  cases  that  occur  in  some  of  the  algorithms.  Depending  on  how  Bk  is 
determined  in  step  9),  the  approximate  Hessian  may  not  be  positive  definite,  and  dk  is  then  taken  as 
-{Bk  +  CklT'gk .  for  some  a*  that  makes  Bk  +okl  positive  definite.  Also,  several  algorithms  use  a  tem¬ 
porarily  updated  version  of  Bk  in  computing  dk. 

Algorithm  2.1  includes  a  linesearch  that  satisfies  the  standard  conditions, 

/  (*k)  -f  (**♦!>  £  CLg  (x*)T(x*«.j -xk) ,  (2.1) 

and 

g (xt+i)r(x*+i  - x* )  2  fig  (xk  f(xk+i  -xk).  (2.2) 

We  do  not  detail  how  the  steplength  parameter  ji*  is  to  be  adjusted  so  that  at  any  iterate  conditions  (2.1) 
and  (2.2)  will  be  satisfied  upon  reaching  step  9);  our  implementation  uses  the  procedure  in  Dennis  and 
Schnabel  (1983).  Accumulated  computational  experience  indicates  that,  for  reasonable  values  of  a  and  p, 


these  conditions  are  satisfied  by  the  initial  value  p*  *  1  most  of  the  time.  When  that  condition  occurs,  we 
are  able  to  use  immediately  all  the  information  gathered  in  step  7),  so  that  all  the  algorithms  described 
below,  except  for  Newton’s  method,  require  only  the  time  for  one  concurrent  set  of  function  evaluations 
to  cany  out  the  step. 

We  now  describe,  in  a  natural  order,  the  algorithms  to  be  considered  in  this  paper. 


Quasi-Newton  and  Newton  methods 

We  first  discuss  how  two  standard  algorithms,  Newton’s  method  and  the  BFGS  method,  may  be 
advantageously  implemented  on  parallel  processors.  The  BFGS  method  algorithm,  which  we  designate 
by  "S"  for  "step  update",  is  obtained  from  Algorithm  2.1  as  follows.  The  matrix  B  i  is  taken  to  be  some 
scaled  multiple  of  the  identity  matrix.  No  finite  difference  directions  are  chosen  at  step  4),  and  no  extra 
function  evaluations  are  done  in  step  7).  The  approximate  Hessian  is  determined  in  step  9)  by  the  BFGS 
step  update,  namely 


»  BkskskTBk  ,  ykyj 


(2.3) 


where  sk  =  and  yk  =gt*i~gk-  The  step  is  computed  simply  by  dk  ^-Bk~lgk,  since  Bk  is 
guaranteed  to  be  positive  definite.  This  algorithm,  as  discussed  in  Schnabel  (1986),  near-optimal 
speedup  on  n+1  or  fewer  processors  by  evaluating  the  gradient  at  each  trial  point  in  step  7)  in  parallel 
with  the  evaluation  of  /(z*+i).  Thus,  if  the  line  search  conditions  are  satisfied  byp*«l,andp2n+l, 
that  iteration  of  the  algorithm  requires  only  the  time  of  one  function  evaluation.  However,  this  method 
cannot  use  more  than  n+1  processors.  We  turn,  therefore,  to  consideration  of  methods  that  can  utilize 
(q+lXn+1)  processors  available  when  q  1 1. 


Since  Newton’s  method  is  known  to  be  an  effective  method  in  the  sequential  case,  it  is  natural  to 
consider  a  parallel  version  of  this  algorithm.  We  will  designate  our  version  of  Newton’s  method  by  "N’’. 
In  Newton’s  method  B  j.  as  well  as  each  Bk+\,  is  taken  to  be  the  finite  difference  Hessian  matrix  at  the 

corresponding  iterate.  This  requires  .(w2^3n)  additional  evaluations  of  / (x).  Since  the  true  Hessian 

need  not  be  positive  definite,  sk  is  taken  to  be  ~(Bk  +  akIT1gk,  where  ak  is  chosen,  as  in  Morf  and 
Sorensen  (1983),  so  that  Bk  +  o*f  is  positive  definite  and  well-conditioned.  For  an  efficient  parallel  ver¬ 
sion  of  Newton’s  method,  at  step  7),  in  parallel  with  the  evaluation  of  /(x*4 j)  and  gfo+i),  the  extra 
q(n+l)  processors  are  used  to  compute  some  of  the  elements  of  Then,  if  the  trial  point  xk+\ 

satisfies  conditions  (2.1)  and  (2.2),  at  step  9)  the  remaining  elements  of  the  finite  difference  Hessian  are 
calculated,  perhaps  requiring  many  cycles  of  parallel  function  evaluations.  If  the  number  of  processors  is 

at  least  (”2  +  Sn  f.?).,  then  / .  g ,  and  H  may  be  evaluated  in  one  cycle,  which  would  make  Newton’s 

method  quite  competitive.  However,  Algorithm  2.1(N)  may  be  quite  inefficient  when  n  is  large  relative 
to  q .  since  it  requires  several  cycles  of  parallel  function  evaluations  for  each  iteration. 

If  p  is  between  n+1  and  .22,  there  is  room  for  a  method  which  gathers  more  informa¬ 

tion  per  iteration  than  a  secant  method  but  less  than  Newton's  method.  In  the  rest  of  this  section  we  con¬ 
sider  several  methods  that,  like  the  BFGS  method,  take  only  one  cycle  of  parallel  function  evaluations  for 
each  trial  point,  but  also  utilize  effectively  the  extra  q(n+l)  function  evaluations  per  cycle  to  approxi¬ 
mate  the  Hessian  more  accurately  than  does  the  BFGS  method. 

Finite  difference  update  of  part  of  the  Hessian 

Perhaps  the  most  natural  idea  satisfying  our  goals  is  to  simply  evaluate  as  many  elements  of 
H{xk+ 0  as  possible  using  the  extra  processors,  and  to  update  Bk  to  Bk+\  by  simply  overwriting  the 


appropriate 


components  of  Bk  by  die  dements  of  H  (x*+i).  It  is  interesting  to  note  that  the  PSB  update 

2  m£  (v -Bk  u)uT+u(v -Bk u  )T  uT(v-Bku)uuT 

**1  k  JTJJ  (TO  ' 

is  equivalent  to  overwriting  row  j  and  column  j  of  Bk  with  v  e  Jf",  if  n  is  the  jA  standard  basis  direc¬ 
tion.  Similarly,  if  we  take  Uk+\  to  be  q  columns  of  the  identity  matrix,  and  contains  the  correspond¬ 
ing  columns  of  H(xk+ 1),  then  we  can  determine  Bk+\  that  overwrites  the  appropriate  rows  and  columns  of 
Bk  with  V*+i  by  the  generalized  PSB  formula,  as  in  Schnabel  (1983), 

^*♦1  m&k  +  (V*+i  “B*(/*+i)(£4+i7’14+i)“1(/4+i7’  (2.4) 

+  UmWmTUmT'Wm  -BkUM)T 
-  UM{UMTUMr'(yM  -BkUk+i)TUM(UMTUMr'UMT  . 

Since  this  method  of  determining  £*+i  dearly  need  not  result  in  Bk+ 1  being  positive  definite,  dk 
must  be  calculated  in  the  same  way  as  in  Algorithm  2.1(N).  We  designate  this  method  as  Algorithm 
2.1  (UP),  where  "U"  indicates  that  the  finite  difference  directions  arc  chosen  as  "unit"  vectors,  and  "P" 
indicates  that  the  finite  difference  information  is  incorporated  into  Bk  through  the  "PSB”  update. 

Specifically,  "U”  means  that  we  partition  the  identity  matrix  into  m  =  blocks  of  q  adjacent  columns, 
and  in  step  4)  chose  £/*+ 1  to  be  the  next  block  of  q  columns. 

Algorithm  2.1(UP)  has  several  obvious  theoretical  properties.  First,  it  is  trivial  to  see  that  if /  (x)  is 
a  quadratic  function,  with  Hessian  H ,  then  Algorithm  2.1  (UP)  will  terminate  after  m  steps,  since  after  m 
steps  we  will  have  Bm  *  H .  It  is  also  fairly  easy  to  show  that  this  method  has  a  local  m-step  Q-quadratic 
convergence  rate,  and  also  is  1-step  Q-superlineariy  convergent.  One  disadvantage  of  this  method  is  that, 
even  when  H(xk+\)  is  positive  definite,  it  can  produce  an  indefinite  Bk+\.  Also,  it  is  commonly  accepted 
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that  the  BFGS  method  is  superior  to  the  PSB  method  among  secant  algorithms  for  unconstrained  optimi¬ 
zation,  so  it  is  natural  to  consider  incorporating  the  finite  difference  information  into  Bk  by  means  of  the 
BFGS  update. 

Finite  difference  update  of  part  of  the  Hessian  using  the  BFGS  formula 

We  now  consider  two  methods  that  art  similar  to  Algorithm  2.1  (UP),  but  drat  use  the  BFGS  update 
to  incorporate  the  new  finite  difference  information,  rather  dun  the  PSB.  First,  we  consider  the  method 
obtained  by  simply  modifying  Algorithm  2.1(UP)  by  updating  through  the  generalized  BFGS  formula 
(Schnabel  (1983)),  obtaining 

Bk+X  =  Bk-Bk Uk+m+iTBk VMrxUk^TBk  +  V*+1(Ut+1rV*+,)V*+,r  .  (2.5) 

Since  V*+i  is  a  finite  difference  approximation  to  #(x*+iK4+i,  it  may  be  necessary  to  symmetrize  the 
matrix  Uk+ir  Vt+ 1-  Sufficiently  close  to  a  stria  local  minimum,  the  q  xq  matrix  Uk+?Vk+\  will  be  posi¬ 
tive  definite,  and  thus  will  be  positive  definite  if  Bk  is.  In  the  case  that  t/k+ iTV*+i  is  not  positive 
definite,  we  will  use  a  maximal  subset  of  the  columns  of  Uk+\  that  yields  a  positive  definite  matrix  and 
save  the  rest  of  the  columns  to  use  as  finite  difference  directions  at  the  next  step.  We  designate  this  algo¬ 
rithm  as  Algorithm  2.1(UB),  where  "U"  indicates  that  in  step  4),  £/*+ 1  is  taken  to  be  the  next  block  of  q 
unit  vectors,  and  "B"  indicates  that  in  step  10),  £*+]  is  determined  from  Bk  and  the  finite  difference  infor¬ 
mation  by  using  the  generalized  BFGS  update.  We  include  Algorithm  2.1(UB)  only  to  show  how  much 
its  performance  is  improved  by  an  idea  to  be  discussed  later.  Although  this  method  does  maintain  posi¬ 
tive  definiteness  of  the  matrices  [Bk ),  it  does  not  terminate  in  m  steps  on  a  quadratic  function,  and  it  per¬ 
forms  very  poorly  in  practice. 

However,  by  choosing  die  finite  difference  directions  more  intelligently  we  can  preserve  the  qua¬ 
dratic  termination  property  and  obtain  a  considerably  better  method.  If  we  choose  l/*„i  to  be  orthogonal 


•A*V»V 


to  the  m—  1  previous  matrices  V*_y ,  for  0  £  y  Sm-2,  then  the  finite  difference  directions  asymptotically 
become  block  conjugate  with  respect  to  the  Hessian  of  /.  In  particular,  if  / (z)  is  a  quadratic  function 

with  Hessian  H ,  then  the  sets  of  directions  U  i,  £/j . Um  are  block  conjugate  with  respect  to  H ,  i.e., 

for  IS  /  <j£m,U?HUj»  0.  Thus,  it  is  easy  to  show  that  Bm  »H,  and  so  this  method  terminates  in  m 
steps  on  a  quadratic  function.  We  designate  this  algorithm  as  Algorithm  2.1(CB),  where  "C  indicates 
that  the  finite  difference  directions  are  chosen  as  above  to  be  approximately  conjugate,  and  "B"  indicates 
that  £*+1  is  obtained  from  Bt  and  the  finite  difference  information  by  (2.5).  We  will  show  in  the  next 
section  that  Algorithm  2.1(CB)  is  m-step  Q-quadratically  and  1-step  Q-superlineariy  convergent  Also, 
the  conjugate  method  for  choosing  the  finite  difference  update  directions  is  invariant  under  linear 
transformations,  and,  of  course,  by  updating  through  (2.5),  the  matrices  Bk  are  all  positive  definite. 

Step  update  algorithms 

Algorithm  2.1(UP)  and  Algorithm  2.1(CB)  are  still  somewhat  unsatisfactory,  in  that  they  do  not 
use  the  step  information  contained  in  g*+i  and  g*  in  forming  £*+].  In  fact,  the  computational  results  in 
Section  5  show  that  Algorithm  2.1(S),  the  parallel  BFGS  method,  which  uses  just  the  gradient  differences 
at  successive  iterates  to  approximate  the  Hessian,  performs  better  than  Algorithms  2.1  (UP),  (UB),  and 
(CB),  which  do  not  make  use  of  this  information  and  instead  approximate  the  Hessian  by  using  just  the 
gradients  along  die  finite  difference  directions.  Thus,  this  information  appears  to  be  important,  and  we 
would  like  to  approximate  £*+ 1  in  a  way  that  utilizes  all  of  the  available  gradient  information. 

One  way  to  do  this  would  be  to  simply  update  Bk  twice  at  each  iteration,  along  the  step  direction, 
as  in  standard  secant  methods,  and  along  the  finite  difference  directions.  To  do  this  we  modify  Algo¬ 
rithms  2.1(UP),  (UB),  and  (CB)  as  follows.  In  step  9),  the  counterparts  of  Algorithms  2.1(UB)  and  (CB) 
first  do  the  step  update  (2.3)  using  the  direction  r*  the  counterpart  of  Algorithm  2.1  (UP)  does 


the  step  update 


Bl+1 »  Bk  + 

j*tj* 


tkT(yk-Bksk)skskT 


(skJsk) 


(2.6) 


The  resulting  matrix  in  either  case  is  then  updated  by  the  finite  difference  update  as  in  methods  (UP), 
(UB),  or  (CB).  We  designate  these  algorithms  by  (UPS),  (UBS),  and  (CBS),  where  "S"  indicates  that  a 
step  update  (2.3)  or  (2.6)  is  done.  Note  that  the  extra  update  uses  information  already  available,  so  that 
still  only  one  concurrent  function  evaluation  is  required  per  step. 


This  approach  has  the  drawback  that  there  is  little  hope  of  choosing  the  finite  difference  directions 
effectively  as  was  done  with  (UP)  and  (CB).  In  particular,  we  have  seen  no  way  to  choose  Uk+ j  so  as  to 
maintain  finite  termination  on  a  quadratic  function  or  m-step  quadratic  convergence  on  general  /  (x). 
However,  these  methods  perform  well  in  the  experiments  of  Section  S.  and  dearly  merit  further  con¬ 
sideration. 


Temporary  step  update  algorithms 


A  way  to  preserve  the  progress  made  on  the  Hessian  approximation  by  the  finite  difference  updates 
while  using  the  Hessian  information  along  r*  is  to  only  temporarily  update  Bk  along  the  step  direction. 
More  spedfically,  we  can  modify  each  of  the  algorithms  (UP),  (UB),  and  (CB),  to  obtain  corresponding 
algorithms  designated  (UPT),  (UBT),  and  (CBT),  as  follows.  In  step  9),  is  updated  from  Bk  along 
the  finite  difference  directions  by  (2.5)  or  (2.4).  Then,  in  step  3),  we  either  calculate 


sk‘Bk+iSk  sk‘yk 


for  (UBT)  and  (CBT),  or 


Bt+i  *  Bk+i  +  (yk~Bk*\St)T  _  *tr(y*-£t+iJt)s*s*r 

*4  j*fj*  (sPlTF  * 

for  (UPT),  where  yg  ■  i  -g*.  We  then  use  0*«.i  to  compute  dg  exactly  as  is  used  in  the 
corresponding  algorithm  without  the  temporary  step  update.  The  "T"  in  the  designation  for  these  algo* 
rithms  indicates  the  fact  that  die  matrices  Bg+ 1  obtained  through  the  step  direction  update  are  "temporary" 
in  that  they  are  only  used  for  the  next  step  computation  and  then  never  referenced  again;  i.e.,  is  cal¬ 
culated  from  Bg+ not  from  £*+1-  This  idea  was  suggested  to  us  by  a  related  method  of  Li  (1986)  for 
nonlinear  equations. 

This  approach  has  the  advantage  that  the  theoretical  properties  of  the  unmodified  algorithms  are 
preserved,  since  die  step  update  only  affects  the  next  step,  and  is  not  incorporated  into  Bg.  Thus,  as  we 
will  show.  Algorithm  2.1(UPT)  and  Algorithm  2.1(CBT)  maintain  their  m  -step  Q-quadratic  and  1-step 
Q-superlinear  convergence  rates.  Also,  the  addition  of  the  temporary  step  update  provides  significantly 
improved  performance  over  the  pure  finite  difference  algorithms  in  practice,  as  seen  in  Section  5.  This  is 
perhaps  due  to  the  observation  that,  in  practice,  successive  steps  in  optimization  algorithms  often  tend  to 
lie  along  roughly  the  same  direction,  so  that  in  only  using  the  most  recent  step  direction  in  updating  the 
matrix  to  be  used  to  compute  the  next  step,  most  of  the  relevant  step  information  is  retained. 


3.  Convergence  Analysis. 

In  this  section  we  analyze  the  local  convergence  properties  of  two  of  the  algorithms  discussed  in 
the  previous  section,  namely  Algorithm  2.1(CB)  and  Algorithm  2.1(CBT).  We  also  discuss  some  of  the 
theoretical  properties  of  the  other  algorithms  from  Section  2.  For  simplicity  we  assume  that  the  finite 
difference  values  used  are  exact,  i.e.  that  each  g*«.|  ■  g(x*4)),  and  that  «  H at  step  7)  of 


Algorithm  2.1. 


The  first  two  algorithms  in  Section  2,  parallel  versions  of  the  BFCS  method  and  Newton's  method, 
differ  from  the  standard  sequential  methods  only  in  that  the  new  function  value  and  (some  of)  the  func¬ 
tion  evaluations  for  die  new  finite  difference  derivative  values  are  evaluated  concurrently.  The  sequence 
of  iterates  produced,  and  hence  the  convergence  properties,  are  identical  to  those  of  the  corresponding 
sequential  methods. 

The  first  new  algorithm  discussed  in  Section  2,  Algorithm  2.1  (UP),  simply  overwrites  q  rows  and 
columns  of  the  Hessian  with  accurate  information  at  each  iteration.  Thus  it  clearly  is  m-step  Q- 
quadratically  and  1-step  Q-superiineariy  convergent,  and  terminates  with  the  exact  solution  in  at  most  m 
steps  if  /  is  a  quadratic  functioa  It  is  not  to  be  expected  that  any  of  these  properties  apply  to  Algorithm 
2.1(UB),  which  as  we  mentioned  is  included  only  to  make  some  computational  points  in  Section  5. 

The  next  algorithm.  Algorithm  2.1(CB).  is  not  as  easy  to  analyze  because  the  update  directions  are 
block  conjugate  only  asymptotically.  It  is  straightforward  to  show  that  this  again  leads  to  m-step  termi¬ 
nation  on  quadratics,  since  the  BFGS  update  is  invariant  under  linear  transformations.  The  following 
analysis  shows  that  this  method,  like  Algorithm  2.1  (UP),  is  m-step  Q-quadratically  and  1-step  Q- 
superlinearly  convergent 

We  now  state  the  standard  assumptions  under  which  we  will  prove  the  two  theorems  of  this  section. 
Assumptions  3.1. 

Let  S  be  an  open  subset  of  Rm  and  suppose  that/ (z),  g(x),  and  H(x)  are  continuous  on S.  Lets*  e  S 
be  a  strict  local  minimum  of  /  (x).  Assume  that  H(x)  is  Lipschitz  continuous  in  some  neighborhood  of 
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To  prove  our  m  -step  Q -quadratic  convergence  results,  it  will  be  necessary  to  show  that  the  error  in 
the  approximate  Hessian  at  an  iterate  is  of  the  order  of  the  errors  in  the  previous  m  iterates.  In  showing 
this,  the  following  two  definitions  will  be  useful. 

Definition 

Given  sequences  [xk }  and  {£*},  let 

6*  *max{  1 1  xk+j  -x.  1 1  : 1  Zj  &m) 
and 

Hk  «  max{K :  1  £  j  i  m } . 

The  fol1  owing  lemma  shows  that  if  m  steps  are  taken,  with  approximate  Hessians  that  are  not  too 
ill-conditioned,  and  the  iterates  remain  dose  enough  to  x* ,  then  the  error  in  the  m  th  approximate  Hessian 
is  of  the  order  of  the  errors  in  the  iterates. 

Lemma  3.1. 

Let  fol  and  [Bk]  be  generated  by  Algorithm  2.1(CB),  and  suppose  that /(x)  and  x.  satisfy  Assump¬ 
tions  3.1.  Then  for  any  M  >  0,  there  isanr>0andc>0  such  that  for  any  k ,  if  6*  <  r  and  k*  S, M , 
then  II  «Jc5*. 

Proof.  Let  M  >  0.  Since  H(x)  is  Lipschitz  continuous  on  some  neighborhood  of  x.  and  H.  is  positive 
definite,  there  is  an  r  >  0  and  an  M\  >  0  such  that  if  I  lx  -x.  1 1  <  r  and  I  lx  -x.  1 1  <  r  then 

Li(//(x))i^-.X,(//(x))SAf,,and  ll/f(x)-f/(x)ll  SAf,  I  lx -x  II. 

Consider  any  k ,  and  assume  that  6*  <  r  and  x*  £  M.  In  this  proof  for  notations]  convenience  and 
darity,  we  omit  the  "k+"  expressions  in  all  subscripts,  e.g.  we  write  f  for  Also,  define 

Hj  *H(Xj). 


Consider  1  £j  S  m-1.  By  the  BFGS  update  formula, 

Bj+X-Bj  - Bj Vj+xWjjBj Uj.xT'UjjBj  +Hj+lUj+l(UJ.lTHj+iUj.lr'Uj.lTHJ+x . 

Thus,  for  1  Si  £J, 

Bj+xUi  -  H.  Vi  -  Bj  Ui  -  H.  Vi  - 

BjVj+xWjjBjVj+xr'i  VjjBjVi  -  Vj+{TH.  Vi  +  Vj+ xTH.  Vt  -  £/j+,r //,//,  )  + 
Hj+xVj+xWjjHj+xVj+xTK  UjjH,  Vi  +  Vj^t(H.  - //,)£/, ), 

since  Vj+\T Hi  Vi  *  0.  Thus,  since  I  i  £7/  1 1  =  1  for  all  / , 

II  Bj+xVi -H.Vi  II  S  WBjVi-H.Ui  II  +  1 1  £,£/;- //.£/,  llic(B,) 

+  II//,  -//.  IIk(B>)+  II Hj+x-Hi  1 1 K(Hj+x) 
i  I \BjVi  -H.Vi  11(1  +M)+M  1 1  lx,-  — x.  \\M  +Mx\\Xj+x-Xi  IIM, 

S(l+Af)  I IByt/j  —//•£/,•  II  +A/Afi5o+2nA/i26oSL  (IIByl/,  -//.l/,  II  +5o), 

where  L  *  max  { 1  +  M ,  MM  i  +  2nM  ,2 } .  Thus,  it  is  easy  to  show  by  induction  that  fori  Si  Sm-1, 

1 1  Bm  Vi  -H.Vi  II  SL— '  II  BiUi-H-Ui  II  +  (^TL*)So. 

Also,  for  1  S  i  S  m .  B,  U,  -HiUi  so  that 

WBtVi -H.Vi  M  *  II B,Ui  - HiVi  +HiVi -H.Vi  U 
«  I \Hi  Vi -H.Vi  I  ISM,  I  lx,  -x.  IIS5oM,. 

Thus,  with  cx»Lm(Mx  +  m) ,  \\BmV,  -H.V,  II  S  c  i5o  for  1  S  i  S  m . 

Since  Vm  is  chosen  orthogonal  to  Ht  Vt .  for  1  S  /  S  / ,  if  we  let  V  be  the  n  xn  matrix  obtained  by 
concatenating  the  matrices  U, ,  for  1  S i  S  /n ,  let  V  be  the  matrix  obtained  by  concatenating  the  matrices 
//,£/,,  for  1  Si  S/n,  and  let  D  * VTV,  then  D  is  block  diagonal,  with  diagonal  blocks  U,TH,U,- 


Further,  for  each  i , 


1  • 

since  lit/,  ll*l,so  !!£>->  1 1  £  A/,.  Thus, 

Ntf"'  II  *  II D~xVt  II  £  I  IZ>'*  1 1  IIVr  II  £M\  1 1  Vr  1 1  . 

Note  that 

1 1 VT  1 1  -  IIVTx  1 1  £  n^M i . 

since  for  any  x  with  I  lx  II  =  1.  if  we  let  v  =//(x;)u  be  a  typical  column  of  V,  where  I  In  1 1  =  1.  and 
1  £  /  £  m ,  then  vTx  £  1 1  u  II  ll//(S)ll  llxll£Afj.  So,  I \U~l  1 1  £/i*Af,2 

Now,  each  column  of  (B„  -H.)U  has  norm  less  than  CjSo,  so  dearly  there  is  a  constant  c2  such 
that  11(8. -H.)U  II  £  c25^  Hence, 

118, -//.  II*  ll(B, -//.)LTLT~>  1 1  £  11(8*  -H*)U  1 1  lit/-*  1 1  £c2«V4A/i250 
and  the  desired  result  follows  with  c  *  c2nwA/ 12.  □ 

In  the  proof  of  our  main  local  convergence  result  below  we  make  Are  strong  assumption  that  x>  is  a 
strict  global  minimum  of  /  (x).  We  do  this  in  order  to  rule  out  the  possibility  of  taking  an  unreasonably 
long  step  during  the  first  m  iterations.  Alternatively,  we  could  prove  m  -step  Q-quadratic  convergence 
under  the  assumption  that  B  i  is  sufficiently  close  to  H» . 

Theorem  3.1. 

Suppose  drat  /(x)  and  x.  satisfy  Assumptions  3.1.  Suppose  in  addition  that  x.  is  a  strict  global 
minimum  of  fix).  Then  for  any  positive  definite  matrix  B\ ,  there  is  an  e>0  such  that  if 
I  Ixt  —  x«  II  <e,  then  the  sequence  {x* )  generated  by  Algorithm  2.1(CB)  converges  m-step  Q- 
quadratically  and  1-step  Q-superiineariy  to  x- . 


Proof. 


Let  B ,  be  a  given  positive  definite  matrix. 

We  first  obtain  a  neighborhood  in  which  H(x)  is  well-behaved  and  show  that  in  this  neighborhood, 
the  condition  numbers  k(B  i), . .  „  <BW)  are  bounded. 

Since  H(x)  is  Lipschitz  continuous  on  some  neighborhood  of  x*  and  ff*  is  positive  definite,  there 
are  constants  r,  and  y  such  that  if  I  lx  -x.  II  <r,.  then  X,(H(x))>  0.  IIH(x)ll  £2 1 IH.  II,  and 
ll/fCxr1 1 1  ^2 1 !//.-» II.  and  if  also  llx-x.  II  <r,,then  I  l//(x)-tf(x)  1 1  Syllx-x  II.  Define 

Af,=min{y,2ll/f.  1 1 . 2 1 1  //.  “>  1 1 .  I  IS,  II.  IIS,"1 11.1} . 

Let  rii  >  0  be  such  that  if  IIS  -//.  II  <tj,  for  a  matrix  B .  then  IIS-1  II  SAf,  and  IIS  II  SAf,.  Let 

N\ *  {x  c  S"  :  I  lx  -x.  II  <  r,}. 

It  is  easy  to  show,  in  similar  fashion  to  the  argument  in  Fletcher  (1970),  that  if  H  and  S  arc  posi¬ 
tive  definite  matrices,  U  has  q  independent  columns,  and 

B+  =  B  -  BU(UTBU)~lUTB  +HU(.UtHU)-1UtH  , 

then 

X,(B*)amin{L,(tf).  X,(S)) 

and 

K (S+)  S  max [X. (//).  K (S )) . 

Further,  if  H  =  H (£)  for  some  \  e  Af ,,  then  clearly 
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r.» 


& 


rv* 

§ 


X»(B*)£max{A/,.A/,2A«(B)). 

So,  » trivial  induction  shows  that  if  IIB*+i  II  ZMX,  IIB*+i-1  H  ZMX,  and  5*  Sr,  then  for  IS; Sm, 
\\Bk+j  II  SAfj3>  and  IIB**y-,ll  SAf ,3>.andsoiQ  SA/,6".  LetAf2>Af x6m. 

We  now  construct  a  neighborhood  N2  contained  in  N\  in  which  Lemma  3.1  applies,  and  show  that 
die  iterates  remain  in  this  neighborhood.  By  Lemma  3.1,  there  are  r  and  c  such  that  for  any  k,  if  5 k  <r 

and  k*  S M2,  then  II B^-Af.  II  Sc5*.  Letr2«min{r,,  r,  ^-}.  Take 

F  *min{/ ( x ) :  r2  S  Hz  -x«  II), 

and  let  N2  =  {x  e  S  :/(x)  <  F }.  Then,  since  z.  is  the  stria  global  minimum  of /(z)  and  condition 
(2.1)  is  satisfied,  if  xxe  N2  then  for  all  k,  xk  €  N2.  Thus,  6*  <r2Sr,  for  all  k.  Now.  since 
HB,II  SA/,  and  1 1  ^  j-*  II  SAf KoSA/2,  and  so  II  Bm-H»  II  ScSoSer2  <  tij.  So,  MB*  II  SAf, 
and  I  IB*,-1  II  S  A/,.  Hence,  by  a  simple  induction,  we  have  that  WBj  II  S  A/2  and  I  IB,-1 1 1  s  Af2  for 
ally, and  IIB** -H»  II  Sc5*(*,_j)forallk. 

Thus,  [xj )  is  contained  in  N2  and  k (By)  S  Af  2  for  all  j,  so  Lemma  2.1  of  Byrd  and  Nocedal  (1986) 
implies  that  (z; )  converges  to  z. ,  since 


~gjsk 


gjBr'gk  ^  1 

H g*  II  Hi*  II  llg*  II  IIB*-‘g*  II  '53571  • 

We  have  shown  that  for  all  k,  8*  £  r2  <  r  and  k*  S  A f2.  so  for  any  k,  I  IB***,  -//•  II  S  c5*  <tj,. 
Also,  by  Theorem  6.4  of  Dennis  and  Mori  (1977)  it  follows  that  for  all  large  k ,  p.*  *  1. 

Consider  any  k  with  p*  ■  1.  Then  for  some  between  x***  and 

llxt^+i-x.  II  *  Mzt+W  -z.  -Bt^-'g*^  II 
S  IIB^-MI  IIB*4W(z*^-z.)-//(^)(x*+m-z.)ll 


SAM  I I  +  IIH(x^)  +  tf.  II  )  llx^-x.  II 

SA/j  (c8*  +A/i  1 14*.*,  -x»  11)1  la*.**  -x.  1 1  . 
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Since  g**,  *g(x-)  +  tf  (4*+-Xxi+«  -x»)  for  some  4**,  between  x»  andx**,, 

11^ -x.  Ili  llx**,-x.  11  +  llfl*4--W  II  S(l+A/,2)Ilx**,-x. 


So,  we  have  that 


I lx*4— +i— x*  II  iAflC5*  I  lx**,  -x.  II  +Af,2(l+A/,2)llx**1-x.  II2 
i(A/,c  +A/,2(1+A/i2))5*  I  lx**,  —x*  II  . 

Thus,  by  definition,  (x* )  converges  to  x.  at  an  m-step  Q-quadratic  rate,  and  since  6*  converges  to  0. 
clearly  also  (x* )  converges  to  x.  at  a  1-step  Q-superlinear  rate.  □ 

Theorem  3.2  shows  that  Algorithm  2.1(CBT),  which  adds  a  temporary  BFGS  update  of  the  stan¬ 
dard  secant  information  to  the  partial  Hessian  information  update  of  Algorithm  2.1(CB),  has  the  same 
local  convergence  properties  as  Algorithm  2.1(CB).  This  is  of  interest  because  Algorithm  2.1(CBT)  turns 
out  to  be  the  better  of  the  two  methods  in  practice.  The  technique  of  proof  is  related  to  one  used  by  Li 
(1986)  for  a  related  temporary  update  method  for  solving  systems  of  nonlinear  equations. 


I' 


Theorem  3.2. 

Suppose  that  /(x)  and  x*  satisfy  Assumptions  3.1.  Suppose  in  addition  that  x.  is  a  strict  global 
minimum  of  /(x).  Then  for  any  positive  definite  matrix  B\,  there  is  an  e>0  such  that  if 
llxj-x.  li  <e,  then  the  sequence  (x* )  generated  by  Algorithm  2.1(CBT)  converges  m-step  Q- 
quadratically  and  1-step  Q-superlineariy  to  x* . 

Proof. 

We  prove  this  result  by  indicating  the  necessary  modifications  to  the  proof  of  Theorem  3.1. 
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Fust,  note  that  the  proof  of  Lemma  3.1  need  not  be  changed  at  all  in  order  to  apply  to  Algorithm 
2.1(CBT),  since  the  steps  (s* }  do  not  enter  in  to  the  proof,  and  the  matrices  (B* }  are  unchanged. 

Next,  let  Bj  be  the  approximate  Hessian  with  which  the  step  is  computed,  Le. 


Bj  ~Bj  -  Bjfj-jSj-i7 Bj  +  Vj-iW  ' 
1  1  Sj-iTBjSj-i  sj-i'yj- 1 


Qeariy  IIB;  1 1  £A/j3  II Bj  II  and  llfl,-1  II  &Mi3  MB;-1  II,  so  if  we  take  and  replace 

Bj  by  Bj  throughout  in  the  proof  of  Theorem  3.1,  we  still  obtain  the  fact  that  [xj }  converges  to  x. ,  and 
UB^-H.  II  <Jc5*. 


Thus,  by  Theorem  3.2  of  Broyden,  Dennis,  and  Morg  (1973),  it  is  easy  to  see  that  there  is  a  constant 
Ci  such  that  for  all  k,  I  IBi4m  -H»  1 1  £  Ci8* .  Thus,  since  {8* )  converges  to  0,  we  have  that  for  all  large 
k,c  i5*  <  rii,  and  we  can  finish  the  proof  with  in  place  of  BtMn  and  cj  in  place  of  c.  □ 


A  similar  analysis  could  be  used  to  show  that  Algorithm  2.1(UPT)  is  also  m-step  Q-quadratically 
and  1-step  Q-superlineariy  convergent 


4.  Computer  Implementation. 


In  this  section  we  discuss  our  computer  implementation  of  the  algorithms  described  in  Section  2. 
These  algorithms  are  similar  to  well-known  sequential  algorithms  in  most  respects,  so  we  will  concentrate 
on  the  aspects  that  result  from  a  parallel  processing  context 

We  have  considered  algorithms  that  compute  q  extra  gradients  at  each  point  but  at  the  present  time 
we  have  only  implemented  a  version  that  uses  1  extra  gradient  at  each  point  Because  this  is  the  case  that 
is  reported  on  in  Section  5,  and  because  this  case  is  simpler  to  describe  and  understand,  we  will  restrict 


ourselves  to  this  case  in  this  section. 


Since  we  assume  that  g(x)  and  ff(x)  are  not  available  analytically,  but  must  instead  be  approxi¬ 
mated  using  function  values,  we  need  to  describe  bow  we  approximate  the  various  derivative  quantities 
by  finite  differences.  First,  at  a  point  x ,  we  approximate  g  (x )  by  the  usual  finite  difference  fbnnula  (1.2), 
where 

hi  > (machine precision)*  max  { lx,  1, 1) . 

Then,  we  approximate  the  extra  Hessian  information  v*+i  *  H  (x*+i)u*4i  by 
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factorization  of  V* ,  obtaining  V*  where  Qt  isannxn  orthogonal  matrix  and  Rk  isannxr  upper 

triangular  matrix,  where  t  is  the  number  of  columns  of  Vk  that  were  included  in  the  factorization.  In  the 
course  of  die  factorization,  if  a  column  of  V*  makes  too  snail  of  an  angle  with  the  subspace  spanned  by 
die  previously  included  columns  of  V*,  then  it  is  not  included  in  die  factorization.  Then,  we  take  to 
be  column  r  + 1  of  Qk.  Thus,  u*+i  is  orthogonal  to  all  the  columns  of  V*  that  were  included  in  die  factori¬ 
zation.  Note  that  if  Vk  has  rank  *— 1,  then  14+1  is  orthogonal  to  each  //(x*  for  j  «  0 . re— 2. 

Next,  we  let  v*+i  «N(x*+i)u*+i.  If  Bk+\  is  updated  by  the  PSB,  then  we  obtain  V«+i  by  shifting  the 
columns  of  V*  one  to  the  right  and  taking  column  1  to  be  v*+i.  If  Bk+i  is  updated  by  the  BFGS,  and  the 
update  along  u*+i  succeeded,  then  we  determine  V*+i  as  for  the  PSB  update.  If  the  update  was  not  done, 
because  uf+\vk+\  S  0.  then  V*+i  *  Vk.  Thus,  we  continue  to  use  a  finite  difference  direction  until  the 
update  along  it  is  successful.  Initially,  we  take  Vi  to  be  the  first  n  - 1  columns  of  the  identity  matrix. 
Note  that  the  theory  of  Section  2  is  local,  and  in  that  context  the  matrices  Vt_1  can  be  uniformly  bounded 
and  tf  C**+i)  is  always  positive  definite,  so  the  BFGS  updates  always  succeed. 


We  now  discuss  the  determination  of  B  1  in  the  various  algorithms.  All  the  algorithms,  except  of 
course  for  Newton’s  method,  take  the  initial  Hessian  to  be  the  identity.  Then,  before  the  first  step  direc¬ 
tion  is  calculated,  those  algorithms  that  perform  updates  along  finite  difference  directions  update  the  iden¬ 
tity  matrix  to  a  matrix  B j,  by  equation  (2.4)  or  (Z5),  using  the  finite  difference  information  H(xt)u  j. 
This  information  is  available  at  no  cost  since  the  algorithm  has  to  evaluate  /(xi)  and  g(xO,  so  the  extra 
gradient  information  might  as  well  be  calculated  concurrently  and  used  to  improve  the  initial 

Hessian  approximation.  Then,  B\  is  used  to  compute  s  1.  After  obtaining  xi  and  g  (*2).  we  finally  obtain 
B 1  by  the  scaling 


Bx 


s\B\s  1  5- 

~iwrB 


1  • 


where  yi*g2~ii-  The  effect  of  this  scaling,  first  suggested  by  Orcn  (1974),  is  to  ensure  that 
*frt»i-i&«.  which  ^  desirable  since  we  would  like  to  have  B\S\=y\.  Note  that  our  implementation 
of  the  usual  BFGS  algorithm  also  does  this  initial  scaling  of  B  j.  We  experimented  with  other  strategies 
for  determining  B  i  in  our  new  algorithms,  but  found  no  uniform  improvement  over  the  strategy  described 
above. 

The  implementation  of  these  algorithms  uses  code  for  the  linesearch,  perturbed  Cholesky  factoriza¬ 
tion,  and  stopping  conditions  as  described  in  Dennis  and  Schnabel  (1983).  We  implemented  the  version 
of  their  linesearch  method  that  obtains  a  step  satisfying  conditions  (2.1)  and  (2.2). 

5.  Computational  Results. 

We  now  present  and  discuss  some  computational  results  comparing  the  performance  on  a  set  of 
standard  test  problems  of  the  algorithms  we  have  described.  As  mentioned  in  Section  4,  so  far  we  have 
only  tested  the  versions  of  our  new  algorithms  that  utilize  2(n+ 1)  processors  (i.e.  one  extra  finite  differ¬ 
ence  gradient). 

Our  test  set  is  taken  from  die  standard  set  of  small  dimensional  problems  in  Morf  ,  Garbow,  and 
Hillstrom  (1981).  We  omitted  some  of  their  functions  because  the  problems  were  either  badly  scaled  or 
were  not  solved  by  any  of  our  methods  due  to  floating  point  arithmetic  overflows.  The  15  functions  in 
our  test  set  are  listed  in  abbreviated  form  in  Table  A.  1  in  the  Appendix,  in  die  column  labeled  "func." 
Each  function  was  tested  with  three  choices  of  x  i,  namely  the  standard  starting  point  given  by  Morti,  Gar- 
bow,  and  Hillstrom  (1981),  multiplied  by  1, 10,  and  100.  The  column  in  Table  A.1  labeled  "sp"  contains 
the  multiple  of  the  standard  starting  point  that  was  used  in  the  corresponding  test  problem.  Thus,  our  test 
set  consists  of  42  problems,  since  Watson’s  function  was  only  tested  with  one  starting  point,  the  zero 
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vector,  and  the  Chebyquad  function  from  the  farthest  point  was  not  solved  by  any  of  our  methods. 

The  stopping  conditions  used  in  die  code  are  as  described  in  Dennis  and  Schnabel  (1983).  The 
algorithms  successfully  terminated  when  either  die  relative  size  of  the  gradient  was  less  than  10~s  or  die 
linesearch  failed  to  find  a  tower  point  than  die  current  iterate  while  backtracking.  The  algorithms  failed 
to  solve  a  problem  when  either  the  iteration  limit  of  500  iterations  was  reached  or  a  floating  point  arith¬ 
metic  overflow  occurred,  to  Table  A.1,  an  overflow  on  a  problem  for  a  method  is  indicated  by  "***" 
listed  for  the  number  of  iterations  and  for  the  number  of  failed  trial  points,  while " — "  in  these  same  loca¬ 
tions  indicates  that  the  method  reached  the  iteration  limit  on  the  problem.  Hie  tests  were  performed  on  a 
sequential  machine,  a  VAX  1 1/780,  using  double  precision  arithmetic. 

For  each  problem  that  was  solved,  we  recorded  the  number  of  steps  required  and  the  number  of  trial 
points  at  which  die  linesearch  conditions  (2.1)  and  (2.2)  were  not  satisfied.  These  numbers  are  recorded 
in  Table  A.  1 .  Note  that  the  number  of  steps  and  the  number  of  failed  trial  points  is  enough  information  to 
simulate  the  performance  of  our  algorithms  on  a  parallel  machine  with  2(n+l)  processors,  if  we  assume 
that  only  the  time  for  function  and  derivative  evaluations  is  relevant.  Define  a  cycle  of  parallel  function 
evaluations,  or  "f-cycle,”  to  be  a  step  in  Algorithm  2.1  at  which  up  to  2(n+l)  function  evaluations  are  per¬ 
formed  in  parallel.  Thus,  on  our  simulated  parallel  machine,  an  f-cycle  takes  die  same  amount  of  time  as 
one  function  evaluation.  For  all  of  our  algorithms  except  Newton’s  method,  the  number  of  f-cydes 
required  to  solve  a  problem  is  simply  the  number  of  points  x  at  which  /  (x)  is  calculated  in  steps  1)  and 
7)  in  Algorithm  2.1.  This  is  clear,  since  all  the  function  evaluations  required  for  the  derivative  approxi¬ 
mations  are  performed  in  parallel  with  the  evaluation  of  the  function  at  each  trial  point,  or  are  done  in 
parallel  with  the  evaluation  of  /  (*j).  Note  that  each  trial  point  in  the  linesearch  is  either  an  iterate,  if  it 
satisfies  conditions  (2.1)  and  (2.2),  or  is  a  failed  trial  point,  if  it  does  not  satisfy  these  conditions.  Thus, 
for  all  our  algorithms  except  Algorithm  2.1(N), 


number  of  f  -cycles  «1  +  number  of  iterations  +  number  of  failed  trial  points  . 
Newton's  method  is  somewhat  different,  since  more  than  me  f-cycle  is  required  to  compute  the 
Hessian  at  successful  trial  points.  In  parallel  with  the  computation  of /(x)  and  g(x)  at  trial  points,  the 
extra  n+1  processors  can  clearly  be  used  to  compute  n  +1  function  values  for  the  finite  difference  approxi¬ 
mation  to  H(x).  Then,  if  the  trial  point  satisfies  conditions  (2.1)  and  (22),  the  remaining  —  +  j1  —  1 


function  values  for  the  finite  difference  Hessian  approximation  must  be  computed.  This  requires 


f-cycles.  Hence,  for  Algorithm  2. 1  (N), 


number  of  f-cycles  =(  1  +  number  of  iterations  X  1  + 


+  number  of  failed  trial  points  . 


n*+n  -2 

4<«+l) 


Thus,  for  all  of  our  algorithms  we  can  compute  the  simulated  number  of  f-cycles  needed  to  solve 


each  problem  on  a  parallel  machine  with  2(n+l)  processors,  using  the  raw  data  given  in  Table  A.l.  If 
function  evaluation  is  expensive,  this  is  a  very  dose  indication  of  the  total  cost  of  solving  the  problem. 


We  now  describe  our  statistical  comparisons  of  our  algorithms.  We  present  a  statistical  comparison 
of  all  the  algorithms,  as  well  as  a  number  of  statistical  summaries  comparing  the  relative  performance  of 
pairs  of  die  leading  methods.  Each  of  these  pairwise  summaries  is  in  the  format  of  Table  5.1.  The 
column  headings  give  the  abbreviated  designations  of  the  two  algorithms  being  compared,  and  the  rows 
are  calculated  as  follows.  The  row  labeled  "#  solved"  contains  the  number  of  problems  out  of  the  42  in 
the  test  set  on  which  the  method  successfully  terminated,  while  the  rows  labeled  ”#  overflow"  and  "#  hit 
itnlim"  contain  the  number  of  problems  on  which  each  method  failed,  respectively,  from  floating  point 
arithmetic  overflow  and  by  reaching  iteration  500.  The  row  labeled  "#  compared"  contains  the  number  of 
problems  that  were  solved  by  both  methods.  The  comparative  statistics  in  the  last  two  rows  of  the  tables 
are  computed  only  over  this  set  of  comparison  problems.  The  row  labeled  "Ave.  score"  is  calculated  as 


follows.  For  each  comparison  problem,  the  method  which  required  fewer  f-cycles  is  assigned  a  score  of 


”1,"  while  the  other  method  is  assigned  a  score  equal  to  the  number  of  f-cycles  it  required  divided  by  the 
number  of  f-cycles  required  by  the  better  method.  Then,  foe  average  of  foe  scores  for  each  method  over 
foe  set  of  comparison  problems  is  recorded  in  foe  row  labeled  "Ave.  score."  For  each  comparison  prob¬ 
lem,  if  foe  score  of  a  method  is  less  than  or  equal  to  1.1,  then  that  method  is  counted  as  "#  best"  for  the 
problem,  and  foe  total  of  such  problems  for  each  method  is  recorded  in  foe  row  labeled  "*  best." 


Table  S.l 
p  «2(n+l) 

Algorithm: 

(S) 

(N) 

# solved: 

36 

37 

#  overflow: 

4 

4 

#  hit  imlim: 

2 

1 

#  compared: 

34 

#best: 

19 

20 

Ave.  score: 

1.42 

2.09 

We  first  consider  the  comparative  performance  of  Newton's  method  and  the  BFGS  method.  It 
seems  clear  from  Table  S.l  that  the  BFGS  method  is  somewhat  superior  to  Newton's  method  when 
2(n+l)  processors  are  available.  Table  52,  on  the  other  hand,  compares  these  two  methods  under  the 
assumption  that  p  is  large  enough  that  the  entire  finite  difference  Hessian  approximation  can  be  calcu¬ 
lated  concurrently  with  evaluation  of  the  function  and  gradient  values.  It  shows  that  the  extra  Hessian 
information  available  in  Newton’s  method  substantially  reduces  foe  number  of  iterations  required  in  this 
case.  Together,  these  two  tables  show  that  it  certainly  is  reasonable  to  consider  the  type  of  algorithms 
that  we  are  discussing  in  this  paper,  when  the  number  of  processors  is  large  enough  to  allow  an  extra  gra¬ 
dient  evaluation  at  each  trial  point  but  not  large  enough  to  allow  evaluation  of  the  finite  difference  Hes¬ 
sian  in  one  f-cyde,  because  Newton's  method  is  not  optimal,  but  the  Hessian  information  does  seem  to 


help. 
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Table  S.2 

p  i(nl+5n  +2)/2 


Algorithm; 
»  solved: 

»  overflow: 

♦  hititnlim: 

#  compared: 
tbcst: 

Ave.  score: 


(S)  (N) 

36  37 

_ 4_ _ 4_ 

2  1 

34 

8  27 

3.07  1.36 


We  now  compare  the  relative  performance  of  all  our  algorithms  by  the  following  analysis  tech¬ 
nique.  Define  Af,  to  be  the  number  of  f-cydes  required  for  algorithm  Af  to  solve  problem  i .  If  the  algo¬ 
rithm  fails,  either  due  to  overflow  or  by  exceeding  the  iteration  limit,  then  Af,*  We  then  define  the 
average  performance  a,  for  problem  i  to  be  the  median  of  the  values  Af* ,  over  all  methods  Af ,  except  that 
if  the  median  is  •»  then  we  instead  take  a,  to  be  the  largest  Af,  that  is  not  •• .  Next,  define  the  perfor¬ 


mance  Pu  of  method  Af 


where  if  Af ,  «  then  ^-«0.  This  measure  tends  to  compress  the  performance  measure  of  all  the 
methods  but  is  reasonably  good  at  ordering  them. 

Table  5.3  shows  die  performance  of  each  algorithm  as  measured  by  the  above  technique. 


Algorithm  S  N  UP  UB  CB  UPT  UBT  CBT  UPS  UBS  CBS 

Performance  1.08  1.15  1.16  1.69  1.26  1.14  121  1.07  0.9S  0.98  0.90 

We  can  make  tome  interesting  observations  from  this  table.  Hist,  we  note  that  Algorithms  2.1  (UP),  (UB). 
and  (CB).  which  omit  the  standard  secant  equation,  perform  worse  than  die  BFGS  method.  Algorithm 
2.1(S),  even  though  they  use  some  finite  difference  Hessian  information.  Also,  note  that  Algorithm 
2.1(UB)  performs  much  worse  than  Algorithm  2.1(CB),  which  indicates  die  importance  of  choosing  the 
finite  difference  update  directions  to  be  approximately  conjugate  when  using  the  BFGS  update  to  insert 
finite  difference  Hessian  information.  Next,  we  see  that  die  addition  of  the  temporary  step  update  is 
dearly  worthwhile,  since  each  of  Algorithms  2.1(UPT),  (UBT),  and  (CBT)  performs  better  than  the 
corresponding  algorithm  without  die  temporary  step  update.  However,  two  of  these  methods  still  perform 
worse  than  the  BFGS  method,  and  Algorithm  2.1  (CBT)  performs  about  the  same  as  the  BFGS  method. 
Finally,  we  observe  that  the  methods  that  use  alternating  step  and  finite  difference  updates.  Algorithms 
2.1(UPS),  (UBS),  and  (CBS),  perform  the  best  of  our  algorithms.  They  perform  significantly  better  than 
the  BFGS  method. 

In  Table  5.4  and  Table  55  we  give  pairwise  comparisons  of  the  BFGS  method.  Algorithm  2.1(S), 
with  die  three  best  of  our  new  methods,  Algorithms  2.1  (UPS),  (UPS),  and  (CBS).  These  statistics 
confirm  the  conclusion  that  the  alternating  update  methods  perform  better  than  die  BFGS  method.  Also, 
it  is  interesting  to  note  that  the  choice  of  approximately  conjugate  finite  difference  update  directions  that 
yielded  such  an  improvement  of  Algorithm  2.1(CB)  over  Algorithm  2.1(UB)  does  not  give  a  similar 
improvement  for  these  methods,  since  Algorithms  2.1  (UBS)  and  (CBS)  perform  very  similarly.  This  is 
probably  linked  to  the  fact  that  the  m-step  quadratic  convergence  of  methods  (CB)  and  (CBT),  which 


depended  on  using  conjugate  rather  ban  orthogonal  directions,  is  destroyed  by  making  permanent  step 
updates.  Finally,  we  see  that  the  conventional  wisdom  that  the  BFGS  update  is  in  some  sense  superior  to 
the  PSB  update  is  supported  here,  since  Algorithms  2.1(UBS)  and  (CBS)  perform  somewhat  better  than 
Algorithm  2.1(UPS). 


Tables  5.4  and  5-5  indicate  that  our  best  new  methods  have  achieved  about  a  30%  improvement 
over  the  BFGS  method  when  there  are  twice  the  number  of  processors  needed  by  the  BFGS  method.  This 
is  not  a  perfect  utilization  of  processors  but  may  be  about  as  well  as  one  can  do  on  these  small  problems, 
especially  considering  the  comparisons  in  Table  5 2  which  show  a  fairly  small  improvement  by  Newton's 
method  over  the  BFGS  method  on  these  problems. 
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6.  Conclusions. 

We  have  introduced  three  types  of  new  algorithms  that  utilize  extra  function  evaluations  to  obtain 
part  of  the  finite  difference  Hessian  at  each  iteration.  The  first  type  (Algorithms  2.1  (UP).  (UB).  and  (CB)) 
uses  the  finite  difference  Hessian  information  to  update  the  Hessian  approximation  and  omits  the  standard 
secant  update  that  is  made  by  methods  such  as  die  BFGS  method.  The  second  type  (Algorithms 
2.1(UPT),  (UBT),  and  (CBT))  uses  the  finite  difference  Hessian  information  and  makes  a  temporary  stan¬ 
dard  secant  update  as  well.  The  third  type  (Algorithms  2.1  (UPS),  (UBS),  and  (CBS))  makes  updates  both 
with  the  finite  difference  Hessian  information  and  with  the  standard  secant  information  at  each  iteration. 
Each  algorithm  type  has  three  variants:  using  standard  basis  finite  difference  directions  with  PSB  updates 
(first  two  letters  UP)  or  BFGS  updates  (first  two  letters  UB).  or  using  conjugate  finite  difference  direc¬ 
tions  with  BFGS  updates  (first  two  letters  CB). 


We  have  shown  m-step  Quadratic  and  1-step  Q-superlinear  convergence  rates  for  Algorithms 


2.1(CB)  and  (CBT);  the  same  results  clearly  hold  for  Algorithms  2.1  (UP)  and  (UPT).  Here  m 


where  q  is  the  number  of  extra  gradient  evaluations  available  per  iteration.  Our  experimental  results  with 
q  *  1  show  that  of  these  algorithms,  only  Algorithm  2.1  (CBT)  performs  roughly  as  well  as  the  BFGS 
method.  These  algorithms  may  perform  better  than  the  BFGS  method  when  q  >  1  or  on  large  dimen¬ 
sional  problems;  we  plan  to  experiment  with  these  cases. 


Algorithms  2.1(UPS),  (UBS),  and  (CBS)  with  q  « 1  appear  to  perform  better  than  the  BFGS 
method.  We  have  not  been  able  to  show  m  -step  Q -quadratic  convergence  for  these  methods  and  suspect 
that  they  do  not  possess  this  property.  However,  they  appear  to  be  a  promising  approach  to  utilizing  extra 
processors  in  solving  unconstrained  optimization  problems.  Also,  we  believe  that  these  methods  are  1- 


step  Q-superiineariy  convergent  We  intend  to  continue  to  experiment  with  these  methods  with  q  >  1 
and  on  larger-dimensional  problems,  and  to  analyze  their  convergence  properties. 
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Appendix. 


Table  A.1 


—  *  cwrcrfio*,  —  ■  hemim  limit 
Method 


S 

N 

UP 

UB 

CB 

UPT 

UBT 

CUT 

UPS 

UBS 

CBS 

fODC 

a 

* 

heniicni 

Ceiluru 

HELI 

3 

1 

27 

10 

19 

58 

21 

21 

23 

23 
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7 
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7 

9 

4 

6 

20 

22 

6 

6 
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1 

27 
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35 

33 
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17 

17 

1 
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10 
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8 

12 

0 
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convergence  of  several  of  these  methods.  We  also  present  computational  results  which 
illustrate  their  performance  on  parallel  computers  when  function  evaluation  is 
expensive . 
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